Scraping Engine

The IMDb Scraper uses a dual-source extraction strategy combining HTML parsing and GraphQL API calls to maximize data coverage and reliability.

Architecture Overview

The scraping engine is built around the ImdbScraper class, which implements the ScraperInterface and follows Clean Architecture principles:

class ImdbScraper(ScraperInterface):
    def __init__(
        self,
        use_case: UseCaseInterface,
        proxy_provider: ProxyProviderInterface,
        tor_rotator: TorInterface,
        engine: str,
        base_url: str = config.BASE_URL
    ):
        self.use_case = use_case
        self.proxy_provider = proxy_provider
        self.tor_rotator = tor_rotator
        self.engine = engine
        self.base_url = base_url

Location: infrastructure/scraper/imdb_scraper.py:21

Dual-Source Data Collection

HTML Parsing Strategy

The scraper extracts movie IDs from the IMDb Top 250 chart using CSS selectors:

# Extract movie IDs from HTML
html_ids = [
    a["href"].split("/")[2]
    for a in soup.select("td.titleColumn a")
    if "/title/" in a["href"]
]

Location: infrastructure/scraper/imdb_scraper.py:146

GraphQL API Integration

To supplement HTML data, the scraper queries IMDb’s GraphQL endpoint:

def _fetch_graphql_ids(self, cookies: Optional[requests.cookies.RequestsCookieJar]) -> List[str]:
    payload = {
        "operationName": config.GRAPHQL_OPERATION,
        "variables": { 
            "first": config.NUM_MOVIES, 
            "isInPace": False, 
            "locale": config.GRAPHQL_LOCALE 
        },
        "extensions": { 
            "persistedQuery": { 
                "sha256Hash": config.GRAPHQL_HASH, 
                "version": config.GRAPHQL_VERSION 
            } 
        }
    }

    response = make_request(
        url=config.GRAPHQL_URL,
        proxy_provider=self.proxy_provider,
        tor_rotator=self.tor_rotator,
        method="POST",
        json_payload=payload
    )

Location: infrastructure/scraper/imdb_scraper.py:158 GraphQL Configuration:

Endpoint: https://caching.graphql.imdb.com/
Operation: Top250MoviesPagination
Hash: 2db1d515844c69836ea8dc532d5bff27684fdce990c465ebf52d36d185a187b3
Locale: en-US

BeautifulSoup Selectors

The engine uses CSS selectors configured in shared/config/config.py:

SELECTORS = {
    "title": '[data-testid="hero__primary-text"]',
    "year": 'ul.ipc-inline-list li a[href*="releaseinfo"]',
    "rating": '[data-testid="hero-rating-bar__aggregate-rating__score"] span',
    "duration_container": 'ul.ipc-inline-list--show-dividers',
    "metascore": "span.metacritic-score-box",
    "actors": "a[data-testid='title-cast-item__actor']"
}

Data Extraction Logic

# Title extraction
title_tag = soup.select_one(config.SELECTORS.get("title", ""))
title = title_tag.text.strip() if title_tag else ""

# Year extraction with regex validation
year_tag = soup.select_one(config.SELECTORS.get("year", ""))
year_str = year_tag.text.strip("()") if year_tag else "0"
year_match = re.search(r'\d{4}', year_str)
year = int(year_match.group()) if year_match else 0

# Rating extraction
rating_tag = soup.select_one(config.SELECTORS.get("rating", ""))
rating = float(rating_tag.text.strip()) if rating_tag else 0.0

# Metascore extraction (optional field)
metascore_tag = soup.select_one(config.SELECTORS.get("metascore", ""))
metascore = int(metascore_tag.text.strip()) if metascore_tag else None

Location: infrastructure/scraper/imdb_scraper.py:85

Duration Parsing

The scraper handles IMDb’s varied duration formats (e.g., “2h 30m”, “1h 45m”, “90m”):

duration = None
ul_list = soup.select(config.SELECTORS.get("duration_container", ""))
for ul in ul_list:
    for li in ul.find_all("li"):
        text = li.get_text(strip=True).lower()
        if re.search(r"(\d+h|\d+m)", text):
            hours_match = re.search(r"(\d+)h", text)
            minutes_match = re.search(r"(\d+)m", text)
            h = int(hours_match.group(1)) if hours_match else 0
            m = int(minutes_match.group(1)) if minutes_match else 0
            duration = (h * 60) + m
            break
    if duration:
        break

Location: infrastructure/scraper/imdb_scraper.py:100

Actor Extraction

The scraper extracts the top 3 actors from each movie:

cast_tags = soup.select(config.SELECTORS.get("actors", ""))[:3]
actors = [
    Actor(id=None, name=cast.text.strip())
    for cast in cast_tags if cast.text.strip()
]

Location: infrastructure/scraper/imdb_scraper.py:115

Error Handling & Retry Logic

Robust Request Handling

All HTTP requests use the make_request utility with exponential backoff:

response = make_request(
    url=detail_url,
    proxy_provider=self.proxy_provider,
    tor_rotator=self.tor_rotator
)

if not response:
    logger.warning(f"No se pudo obtener respuesta para la URL: {detail_url}")
    return None

Location: infrastructure/scraper/imdb_scraper.py:71

Retry Configuration

MAX_RETRIES = 3
RETRY_DELAYS = [1, 3, 5]  # Exponential backoff in seconds
REQUEST_TIMEOUT = 10
BLOCK_CODES = [202, 403, 404, 429, 500]

Location: shared/config/config.py:50

Fallback Strategy

The request utility implements a multi-layer fallback:

Primary: Premium proxy (DataImpulse)
Fallback: TOR network with IP rotation
Final: Direct connection through VPN

Location: infrastructure/scraper/utils.py:34

Data Validation

Before persisting, the scraper validates extracted data:

try:
    movie = self._scrape_movie_detail(indexed_id)
    if movie:
        self.use_case.execute(movie)
except ValueError as e:
    logger.warning(f"Datos inválidos para {imdb_id}: {e}. Saltando guardado.")
except Exception as e:
    logger.error(f"Error inesperado al procesar y guardar {imdb_id}: {e}", exc_info=True)

Location: infrastructure/scraper/imdb_scraper.py:58

Traffic Monitoring

The scraper tracks bandwidth usage:

self.total_bytes_used += len(response.content)

# At completion:
logger.info(f"Tráfico total usado: {self.total_bytes_used / (1024 ** 2):.2f} MB")

Location: infrastructure/scraper/imdb_scraper.py:81

Configuration Options

Key configuration options in shared/config/config.py:

# Scraping parameters
BASE_URL = "https://www.imdb.com"
CHART_TOP_PATH = "/chart/top/"
TITLE_DETAIL_PATH = "/title/{id}/"
NUM_MOVIES = 250

# Request settings
REQUEST_TIMEOUT = 10
MAX_RETRIES = 3
RETRY_DELAYS = [1, 3, 5]

# Concurrency
MAX_THREADS = 50

# User-Agent rotation
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/88.0.4324.96 Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0; Nexus 5) AppleWebKit/537.36 Chrome/90.0.4430.91 Mobile Safari/537.36"
]

Architecture Overview

Dual-Source Data Collection

HTML Parsing Strategy

GraphQL API Integration

BeautifulSoup Selectors

Data Extraction Logic

Duration Parsing

Actor Extraction

Error Handling & Retry Logic

Robust Request Handling

Retry Configuration

Fallback Strategy

Data Validation

Traffic Monitoring

Configuration Options

Next Steps

Network Evasion

Concurrency

Documentation Index

​Architecture Overview

​Dual-Source Data Collection

​HTML Parsing Strategy

​GraphQL API Integration

​BeautifulSoup Selectors

​Data Extraction Logic

​Duration Parsing

​Actor Extraction

​Error Handling & Retry Logic

​Robust Request Handling

​Retry Configuration

​Fallback Strategy

​Data Validation

​Traffic Monitoring

​Configuration Options

​Next Steps

Network Evasion

Concurrency

Architecture Overview

Dual-Source Data Collection

HTML Parsing Strategy

GraphQL API Integration

BeautifulSoup Selectors

Data Extraction Logic

Duration Parsing

Actor Extraction

Error Handling & Retry Logic

Robust Request Handling

Retry Configuration

Fallback Strategy

Data Validation

Traffic Monitoring

Configuration Options

Next Steps